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ABSTRACT 


Knowledge tracing allows Intelligent Tutoring Systems to 
infer which topics or skills a student has mastered, thus ad- 
justing curriculum accordingly. Deep Knowledge Tracing 
(DKT) uses recurrent neural networks (RNNs) for knowl- 
edge tracing and has achieved significant improvements com- 
pared with models like Bayesian Knowledge Tracing (BKT) 
and Performance Factor Analysis (PFA). However, DKT is 
not as interpretable as other models because the decision- 
making process learned by recurrent neural networks is not 
wholly understood by the research community. In this pa- 
per, we critically examine the DKT model, visualizing and 
analyzing the behaviors of DKT in high dimensional space. 
We modify and explore the DKT model and discover that 
Deep Knowledge Tracing has some critical pitfalls: 1). in- 
stead of tracking each skill through time, DKT is more 
likely to learn an ‘ability’ model; 2) the recurrent nature 
of DKT reinforces irrelevant information that it uses dur- 
ing the tracking task; 3) an untrained recurrent network can 
achieve similar results to a trained DKT model, supporting a 
conclusion that recurrence relations are not properly learned 
and, instead, improvements are simply a benefit of projec- 
tion into a high dimensional, sparse vector space. Based 
on these observations, we propose improvements and future 
directions for conducting knowledge tracing research using 
deep models. 
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1. INTRODUCTION 


Knowledge tracing has been investigated for decades. It al- 
lows Intelligent Tutoring Systems to infer which topics or 
skills a student has mastered, thus adjusting curriculum ac- 
cordingly. ‘Two widely used models are Bayesian Knowl- 
edge Tracing (BKT) [2] and Performance Factor Analysis 
(PFA)[11]. These models are designed in a way that each 
parameter has a semantic meaning. For example, the guess 
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and slip parameter in the BKT model reflect the probabil- 
ity that a student could guess the correct answer and make 
a mistake despite mastery of a skill, respectively. BKT at- 
tempts to explicitly model these parameters and use them 
to infer a binary set of skills as mastered or not mastered. 
In parallel with research in knowledge tracing models, deep 
neural networks have gained popularity in fields like Natu- 
ral Language Processing and Computer Vision [3, 9]. Piech 
et. al proposed Deep Knowledge Tracing (DKT) [12], using 
recurrent neural networks for knowledge tracing. The DKT 
model achieves significantly improved results compared to 
BKT and PFA. However, its mechanisms are not well un- 
derstood by the research community. That is, none of the 
parameters are mapped to a semantically meaningful mea- 
sure which diminishes our ability to understand how DKT 
performs predictions. There have been some attempts to 
explain why DKT works well [8, 15], but these studies treat 
DKT model more like a black box, without studying the 
state space that underpins the recurrent neural network. In 
this work, we analyze and visualize the learned state space 
of the DKT model to better understand its mechanisms. 


Recurrent neural networks can learn long range dependen- 
cies across many time steps. Long short term memory (LSTM) 
networks, gated-recurrent unit (GRU) networks, and numer- 
ous other variants enhance the vanilla RNNs in one way or 
another have achieved empirical success [6, 1, 5]. However, 
there are incredibly few works explaining what is happening 
under the hood. Karpathy et al. [7] provide a detailed analy- 
sis of the behaviors of recurrent neural network in language 
processing and find that some neurons are responsible for 
long range dependencies like quotes and brackets. We take 
a similar approach for analyzing the DKT model. 


We aim to provide a better understanding of the DKT model 
and a more solid footing for using deep models for knowledge 
tracing. In this paper, we “open the box” of the DKT re- 
current architecture, visualizing and analyzing the behaviors 
of the DKT model in a high dimensional space. We track 
activation changes through time and analyze the impact of 
each skill in relation to other skills. We modify and explore 
the DKT model, finding that some irrelevant information 
is reinforced in the recurrent architecture. Finally, we find 
that an untrained DKT model (with gradient descent ap- 
plied only to layers outside the recurrent architecture) can 
be trained to achieve similar performance as a fully trained 
DKT architecture. Based on our analyses, we propose im- 
provements and future directions for conducting knowledge 
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tracing with deep recurrent neural network models. 


2. RELATED WORK 


Bayesian Knowledge Tracing (BKT) [2] was proposed by 
Corbett et al. In their original work, each skill has its 
own model and parameters are updated by observing the 
responses (correct or incorrect) of applying a skill. Perfor- 
mance Factor analysis (PFA) [11] is an alternative method to 
BKT and it is believed to perform better when each response 
requires multiple skills. Both BKT and PFA are designed in 
a way that each parameter has its own semantic meaning. 
For example, the slip parameter of BKT represents the pos- 
sibility of getting a question wrong even though the student 
has mastered the skill. These models are easy to interpret, 
but suffer from scalability issues and often fail to capture 
the dependencies between each skill because many elements 
are treated as independent to facilitate optimization. 


Piech et al. recently proposed the Deep Knowledge Tracing 
model (DKT) [12], which exploits recurrent neural networks 
for knowledge tracing and achieves significantly improved 
results. Piesch et al. transformed the problem of knowledge 
tracing by assuming each question can be associated with 
a “skill ID”, with a total of N skills in the question bank. 
The input to the recurrent neural network is a binary vector 
encoding of skill ID for a presented question and the correct- 
ness of the student’s response. The output of the recurrent 
network is a length N vector of probabilities for answering 
each skill-type question correctly. The DKT model could 
achieve >80% AUC on the ASSISTmentsData dataset [4], 
compared with the BKT model that achieves 67% AUC. 
This is an exciting result because it demonstrates the possi- 
bility of using neural networks for knowledge tracing. 


Despite the effectiveness of DKT model, its mechanism is 
not well understood by the research community. Khajah et 
al. investigate this problem by extending BKT [8]. They ex- 
tend BKT by adding forgetting, student ability, and skill dis- 
covery components, comparing these extended models with 
DKT. Some of these extended models could achieve close re- 
sults compared with DKT. Xiong et al. discover that there 


are duplicates in the original ASSISTmentsData dataset [15]. 


They re-evaluate the performance of DKT on different sub- 
sets of the original dataset. Both Khajah and Xiong’s work 
are black box oriented—that is, it is unclear how predictions 
are performed within the DKT model. In our work, we try 
to bridge this gap and explain some behaviors of the DKT 
model. 


Trying to understand how DKT works is difficult because 
the mechanisms of RNNs are not totally understood even in 
the machine learning community. Even though the recurrent 
architecture is well understood, it is difficult to understand 
how the model adapts weights for a given prediction task. 
One common method used is to visualize the neuron activa- 
tions. Karpathy et al. [7] provide a detailed analysis of the 
behaviors of recurrent neural network using character level 
models and find some cells are responsible for long range 
dependencies like quotes and brackets. They break down 
the errors and partially explain the improvements of using 
LSTM. We use and extend their methods, providing a detail 
analysis of the behaviors of LSTM in the knowledge tracing 
setting. 


3. EXPERIMENT 


To investigate the DKT model, we perform a number of 
analyses based upon the activations within the recurrent 
neural network. We also explore different training proto- 
cols and clustering of the activations to help elucidate what 
is learned by the DKT model. 


3.1 Experiment setup 

In our analyses, we use the “ASSISTmentsData 2009-2010 
(b) dataset” which is created by Xiong et al. after removing 
duplicates [15]. Like Xiong et al., we also use LSTM units 
for analysis in this paper. Because we will be visualizing 
specific activations of the LSTM, it is useful to review the 
mathematical elements that comprise each unit. An LSTM 
unit consists of the following parts, where a sequence of in- 
puts {x1,v2,...,ar} € #& are ideally mapped to a labeled 
output sequence {y1, y2,...,yr} € Y. The prediction goal is 
to learn weights and biases (W and b) such that the model 
output sequence ({hi, he,..., hr} € H) is as close as possible 


to y: 


fe = (Wy « [he-1, ve] + bf) (1) 
iz = o(W; - [he-1, we] + bi) (2) 
C; = tanh(We - [he-1, zt] + bc) (3) 
Ce = fer Cri tive Ce (4) 

ot = o(Wo - [ht-1, et] + bo) (5) 
he = 04 * tanh(Ct) (6) 


Here, o refers to a logistic (sigmoid) function, - refers to 
dot products, * refers to element-wise vector multiplication, 
and [,] refers to vector concatenation. For visualization pur- 
poses, we log the above 6 intermediate outputs for each input 
during testing and concatenate these outputs into a single 
“activation” vector, at = [ft,t+,Cz, Ct, ot, he]. In the DKT 
model, the output of RNN, hz is connected to an output 
layer yz, which is a vector with the same number of ele- 
ments as skills. We can interpret each element in y: as an 
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Figure 1: First two components of T-SNE of the 
activation vector for first time step inputs. Numbers 
are skill identifiers, blue for correct input, orange for 
incorrect input 
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Figure 2: The prediction changes for one student, 23 steps, correct input is marked blue, incorrect input is 


marked orange 


estimate that the student would answer a question from each 
skill correctly, with larger positive number denoting that the 
student is more likely to answer correctly and more negative 
numbers denoting that the student is unlikely to respond 
correctly. Thus, a student who had mastered all skills would 
ideally obtain an y; of all ones. A student who had mastered 
none of the skills would ideally obtain an y; of all negative 
ones. 


Deep neural networks usually work in high dimensional space 
and are difficult to visualize. Even so, dimensionality re- 
duction techniques can help to identify clusters. For exam- 
ple, Figure 1 plots the first two reduced components (using 
t-SNE [10]) of the activation vector, az, at the first time 
step (¢ = 0) for a number of different students in the AS- 
SISTmentsData. The numbers in the plot are skill identi- 
fiers. We use color blue to denote a correct response and 
the color orange to denote an incorrect response. From re- 
ducing the dimensionality of the a; vector for each student, 
we can see that the activations show a distinct clustering 
between whether the questions were answered correctly or 
incorrectly. We might expect to observe sub-clusters of the 
skill identifiers within each of the two clusters but we do 
not. This observation supports the hypothesis that correct 
and incorrect responses are more important for the DKT 
model than skill identifiers. However, perhaps this lack of 
sub-clusters is inevitable because we are only visualizing the 
activations after one time step—this motivates the analysis 
in the next section. 


3.2 Skill relations 


In this section, we try to understand how the prediction 
vector of one student changes as a student answers more 
questions from the question bank. Figure 2 plots the predic- 
tion difference (current prediction vector - previous predic- 
tion vector) for each question response from one particular 
student (steps are displayed vertically and can be read se- 
quentially from bottom to top). The horizontal axis denotes 
the skill identifier and the color of the boxes in the heatmap 
denote the change in the output vector y;. The initial row 
in the heatmap (bottom) is the starting values for y; for the 
first input. As we can see, if the student answers correctly, 
most of the y; values increase (warm color). When an in- 
correct response occurs, most of the predictions decreases 
(cold color). This makes intuitive sense. We expect a num- 
ber of skills to be related so correct responses should add 
value and incorrect responses should subtract value. We 
can further observe that changes in the y; vector diminish if 
the student correctly or incorrectly answers a question from 
the same skill several times repeatedly. For example, ob- 
serve from step 14 to step 19, where the student correctly 
answers questions from skill #113—eventually the changes 
in yz come to a steady state. However, occasionally, we can 
also notice, a correct response will result in decreases in the 
prediction vector (observe step 9). This behavior is diffi- 
cult to justify from our experience, as correctly answering a 
question should not decrease the mastery level of other skills. 
Yeung et al. have similar findings when investigating single 
skills [16]. Observe also that step 9 coincides with a tran- 
sition in skills being answered (from skill #120 to #113). 
Even so, it is curious that switching from one skill to an- 
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Figure 4: Activation vector difference of randomly 
picked 3 skills through time 


other would decrease values in yz even when the response 
is correct. From this observation, one potential way to im- 
prove the DKT model could be adding punishment for such 
unexpected behaviors (for example, in the loss function of 
the recurrent network). 


3.3 Simulated data 


From the above analysis, we see from step 14 to step 19, the 
student correctly answers question from skill #113 and the 
changes in y; diminish—perhaps an indication that the vec- 
tor is converging. Also, from Figure 2, we see that for each 
correct input, most of the elements of y, increase by some 
margin, regardless of the input skill. To have a better un- 
derstanding of this convergence behavior, we simulate how 
the DKT model would respond to an Oracle Student, which 
will always answer each skill correctly. We simulate how 
the model responds to the Oracle Student correctly answer- 
ing 100 questions from one skill. We repeat this for three 
randomly selected skills. 


We plot the convergence of each skill using the activation 
vector at reduced to a two-dimensional plot using t-SNE 
(Figure 3). The randomly chosen skills were #7. #8, and 
#24. As we can see, each of the three skills starts from a 
different location in the 2-D space. However, they each con- 
verges to near the same location in space. In other words, 
it seems DKT is learning one “oracle state” and this state 
can be reached by practicing any skill repeatedly, regard- 
less of the skill chosen. We verified this observation with 
a number of other skills (not shown) and find this behav- 
ior is consistent. Therefore, we hypothesize that DKT is 
learning a ‘student ability’ model, rather than a ‘per skill’ 
model like BKT. To make this observation more concrete, in 
Figure 4 we plot the euclidean distance between the current 
time step activation vector, at, and the previous activations, 
at—1, we can see the difference becomes increasingly small 
after 20 steps. Moreover, the euclidean distance between 
each activation vector learned from each skill becomes ex- 
tremely small, supporting the observation that not only is 
the y; output vector converging, but all the activations in- 
side the LSTM network are converging. We find this be- 
havior curious because it means that the DKT model is not 
remembering what skill was used to converge the network 
to an ‘oracle state.’ Remembering the starting skill would 
be crucial for predicting future performance of the student, 
yet the DKT model would treat every skill identically. We 
also analyzed a process where a student always answers re- 
sponses incorrectly and found there is a similar phenomenon 
with convergence in an anti-oracle state. 


Figure 5 shows the skills prediction vector after answering 
correctly 20 times in a row. We can see the predictions of 
most skills are above 0.5, regardless of the specific practice 
skill used by the Oracle Student. Now, we can safely say 
that the DKT model is not really tracking the mastery level 
of each skill, it is more likely learning an ‘ability model’ from 
the responses. Once a student is in this oracle state, DKT 
will assume that he/she will answer most of the questions 
correctly from any skill. We hypothesize that this behav- 
ior could be mitigated by using an “attention” vector during 
the decoding of the LSTM network [13]. Self attention in 
recurrent networks decodes the state vectors by taking a 
weighted sum of the state vectors over a range in the se- 
quence (weights are dynamic based on the state vectors). 
For DKT, this attention vector could also be dynamically 
allocated based upon the skills answered in the sequence, 
which might help facilitate remembering long-term skill de- 
pendencies. 


3.4 Temporal impact 

RNNs are typically well suited for tracking relations of in- 
puts in a sequence, especially when the inputs occur near 
one another in the sequence. However, long range depen- 
dencies are more difficult for the network to track [13]. In 
other words, the predictions of RNN models will be more im- 
pacted by recent inputs. For knowledge tracing, this is not 
a desired characteristic. Consider two scenarios as shown 
below: For each scenario, the first line is the skill numbers 
and the second line are responses (1 for correctness and 0 for 
incorrectness). Both two scenarios have the same number of 
attempts for each skill (4 attempts for skill #9, 3 attempts 
for skill #6 and 2 attempts for skill #424). Also, the ordering 
of correctness within each skill is the same (e.g., 1, 0, 0, 0 
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Figure 5: Prediction vector after 20 steps for skill #7, #8, #24 
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Figure 6: DKT predictions from two different stu- 
dents. The blue line is the prediction of correctness 
from DKT. The red line is the actual response cor- 
rectness(1 or 0). 
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For models like BKT, there is a separate model for each skill. 
Thus, the relative order of different skills presented has no 
influence, as long as the ordering within each skill remains 
the same. In other words, for each skill the ordering of cor- 
rect and incorrect attempts remains the same, but different 
skills can be shuffled into the sequence. For BKT, it will 
learn the same model from these two scenarios, but it may 


not be the case for DKT. The DKT model is more likely to 
predict incorrect response after seeing three incorrect inputs 


in a row because it is more sensitive to recent inputs in the 
sequence. This means, for the first scenario, first attempt 
of skill #24 (in bold) will be more likely predicted incorrect 
because it follows three incorrect responses. For the second 
scenario, first attempt of skill #24 (in bold) is more likely to 
be predicted correct. Thus the DKT model might perform 
differently on the given scenarios. 


Figure 6 gives two typical excerpts from the real dataset for 
two students. In the top example, after several correct in- 
puts, the DKT model has a high probability of predicting 
the next item correct, regardless of the skill (70%). Simi- 
larly, in the bottom example, after several incorrect inputs, 
the DKT model has a low probability of predicting the next 
item correct (8%), regardless of the skill. That means, if a 
student has mastered an easy skill previously but then fails 
three attempts of more difficult exercises, the DKT would 
predict that the student would also fail the already mastered 
skill. We are only giving two samples here due to limited 
space, but this kind of behavior is universal across students, 
which we will talk more next. Again, we hypothesize that 
this behavior could be mitigated by using an “attention” vec- 
tor that allows the DKT to use the whole weighted history 
as additional inputs. 


Table 1: Area under the ROC curve 
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Khajah et al. also alluded to this recency effect in [8]. In 
this paper, we examine this phenomenon in a more quan- 
titative way. We shuffle the dataset in a way that keeps 
the ordering within each skill the same, but spreads out the 
responses in the sequence. This change should not change 
the prediction ability of models like BKT. The results are 
shown in Table 1 and Table 2 using standard evaluation cri- 
teria for this dataset. All results are based on a five-fold 
cross validation of the dataset. When comparing DKT on 
the original dataset to the “spread out” dataset ordering, we 
see that the relative ordering of skills has significant nega- 
tive impact on the performance of the model. From these 
observations, we see the behaviors of DKT is more like PFA 
which counts prior frequencies of correct and incorrect at- 
tempts other than BKT and the design of the exercises could 
have a huge impact on the model (For example, the arrange- 
ments of easy and hard exercises). 


3.5 Is the RNN representation meaningful? 
Recurrent models have been successfully used in practical 
tasks like natural language processing [3]. These models 
can take days or even weeks to train. In a recently pub- 
lished paper, Wieting et al. [14] argue that RNNs might not 
be learning a meaningful state vector from the data. They 
show that a randomly initialized RNN model (with only W, 
and 6b, trained) can achieve similar results to models where 
all parameters are trained. This result is worrying because 
it may indicate that the RNN performance is due mostly 
to simply mapping input data to random high dimensional 
space. Once projected into the random vector space linear 
classification can perform well because points are more likely 
to be separated in a sparse vector space. The actual vector 
space may not be meaningful. We perform a similar exper- 
iment in training the DKT model. We randomly initialize 
the DKT model and only train the last linear layer (W, and 
bo) that maps the output of LSTM h; to the skill vector, yz. 
As shown in Table 1 and Table 2, the untrained recurrent 
network performs similarly to the trained network. 


4. CONCLUSION AND FUTURE WORK 


In this paper, we dive deep into the Deep Knowledge Trac- 
ing model. We have visualized and analyzed the behaviors 
of DKT through time using dimensionality reduction of the 
activations vector, a;. We have also analyzed the temporal 
sequence behavior of DKT using qualitative and quantita- 
tive analyses. We find that the DKT model is most likely 
learning an ‘ability’ model, rather than tracking each indi- 
vidual skill. Moreover DKT is significantly impacted by the 
relative ordering of skills presented. We also discover that 
a randomly initialized DKT with only the final linear layer 
trained achieves similar results to the fully trained DKT 
model. In other words, the DKT model performance gains 
may stem from mapping input sequences into a random high 
dimensional vector space where linear classification is easier 
because the space is sparse. This is a worrying conclusion be- 
cause it means the underlying recurrent representation may 
not be reliable nor semantically meaningful. Several miti- 
gating measures are suggested in this paper, including the 
use of a loss function that mitigates unwanted behaviors and 
the use of an attention model to better capture long term 
skill dependencies. We leave evaluation of these suggestions 
to future work in the educational data mining community. 
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